{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Sampling and Hypothesis Testing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sampling | Confidence intervals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select this cell and type Ctrl-Enter to execute the code below.\n",
    "\n",
    "set_plot_dimensions <- function(width_choice, height_choice) {\n",
    "    options(repr.plot.width=width_choice, repr.plot.height=height_choice)\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Standard normal distribution\n",
    "\n",
    "Consider a normal distribution with mean 0 and standard deviation 1:\n",
    "\n",
    "$$Z \\sim N(0,1)$$\n",
    "\n",
    "$Z$ is known as the *standard normal distribution*.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# a standard normal distribution\n",
    "mu <- 0\n",
    "sigma <- 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the probability density function\n",
    "\n",
    "wid <- 0.01\n",
    "x <- seq(-4,4,wid)\n",
    "pdf <- dnorm(x,mu,sigma)\n",
    "\n",
    "set_plot_dimensions(5, 4)\n",
    "plot(x, pdf, xlab=\"x\", type=\"l\", col=\"red\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the cumulative distribution function\n",
    "\n",
    "cdf <- pnorm(x,mu,sigma)\n",
    "\n",
    "set_plot_dimensions(5, 4)\n",
    "plot(x, cdf, xlab=\"x\", ylim=c(0,1), type=\"l\", col=\"blue\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How much of the probability mass lies within one, two, or three standard deviations of the mean? (Hint: use the cdf!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# one sigma\n",
    "pnorm(1,mu,sigma) - pnorm(-1,mu,sigma)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# two sigma\n",
    "pnorm(2,mu,sigma) - pnorm(-2,mu,sigma)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# three sigma\n",
    "pnorm(3,mu,sigma) - pnorm(-3,mu,sigma)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are often interested in the regions containing a certain percentage of the probability mass, for example, find $z_{90}$ such that\n",
    "\n",
    "$$\\mathbb{P}(-z_{90} < z < z_{90}) = 0.9$$\n",
    "\n",
    "R makes it easy for us to find these *critical values* of $z$, using the *quantile* function `qnorm` e.g."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "alpha <- 0.1  # how much probability will remain in the tails\n",
    "\n",
    "qnorm(alpha/2,mu,sigma)  # the critical value for the lower tail\n",
    "qnorm(1-alpha/2,mu,sigma) # the critical value for the upper tail = z90"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the probability density function\n",
    "plot(x, pdf, xlab=\"x\", type=\"l\", col=\"red\")\n",
    "\n",
    "# the shaded region contains 90% of the probability mass\n",
    "z90 <- qnorm(1-alpha/2,mu,sigma)\n",
    "x_region = seq(-z90,z90,wid)\n",
    "polygon(c(x_region,z90,-z90), c(dnorm(x_region,mu,sigma),0,0), border=NA, col=\"lightgrey\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find the corresponding critical values for 95% and 99% of the probability mass."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#95%\n",
    "qnorm(1-0.05/2,mu,sigma)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#99%\n",
    "qnorm(1-0.01/2,mu,sigma)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Z-score\n",
    "\n",
    "Note that we can *standardise* any random variable by subtracting the mean and dividing by the standard deviation. When applied to a single observation, this is known as the *z-score*:\n",
    "\n",
    "$$z =\\frac{x-\\mu}{\\sigma}$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Confidence intervals"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The best single estimate of the population mean $\\mu$ based on a sample of data is simply the sample mean, $\\bar{x}$.\n",
    "\n",
    "However, it is often useful to describe the precision of our estimate by giving a *confidence interval* instead of a point estimate.\n",
    "\n",
    "We say that the true mean lies within the interval with e.g. 90% confidence.\n",
    "\n",
    "From the central limit theorem, $\\bar{X} \\sim N(\\mu,\\sigma^2/n)$\n",
    "\n",
    "When $n$ is large (>30), we have $\\sigma \\approx s$, so $\\bar{X} \\sim N(\\mu,s^2/n)$\n",
    "\n",
    "Standardisation gives \n",
    "$$z = \\frac{\\bar{x}-\\mu}{s/\\sqrt{n}}$$\n",
    "\n",
    "From the above critical value calculation, the 90\\% confidence interval is where $-1.64 < z < 1.64$\n",
    "\n",
    "Substituting for $z$, we have\n",
    "\n",
    "$$\\mathbb{P}\\left(\\bar{x}-1.64\\frac{s}{\\sqrt{n}} < \\mu < \\bar{x}+1.64\\frac{s}{\\sqrt{n}}\\right) =  0.9$$\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example: speed of light"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below are the results of 64 measurements of the speed of light made by Simon Newcomb in 1882, in km/s."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data <- c(265848, 256680, 248124, 310155, 265848, 201182, 232617,\n",
    "       206770, 275694, 286297, 265848, 256680, 286297, 275694,\n",
    "       338351, 323640, 372187, 297749, 297749, 206770, 323640,\n",
    "       240120, 232617, 310155, 275694, 225568, 465233, 310155,\n",
    "       256680, 206770, 354463, 265848, 286297, 275694, 275694,\n",
    "       232617, 297749, 265848, 310155, 186093, 354463, 240120,\n",
    "       232617, 265848, 286297, 248124, 275694, 286297, 310155,\n",
    "       232617, 256680, 218933, 297749, 391775, 206770, 256680,\n",
    "       248124, 338351, 265848, 225568, 190865, 297749, 465233,\n",
    "       323640)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Based on these data, calculate a 95% confidence interval for the speed of light."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# first, calculate the sample mean and s.d.\n",
    "xbar <- mean(data)\n",
    "xbar\n",
    "\n",
    "s <- sd(data)  ## sd() uses the unbiased estimator\n",
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# now find the critical value\n",
    "z95 <- qnorm(1-0.05/2,mu,sigma)\n",
    "z95"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# finally calculate the 95% confidence interval for mu\n",
    "ci <- c(xbar - z95 * s / sqrt(64) , xbar + z95 * s / sqrt(64))\n",
    "ci"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Confidence intervals for small $n$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The central limit theorem only applies when $n$ is large. For small samples ($n$<30), the normal distribution is not a good approximation to the sampling distribution of the mean, because our estimate of the population variance is likely to be poor.\n",
    "\n",
    "However, for situations where the *population* is expected to be normally distributed, we can use **Student's t-distribution** to construct an appropriate confidence interval. This has a broader bell-curve, which reflects our lack of knowledge about the population variance.\n",
    "\n",
    "The t-distribution takes a single parameter, which is $\\nu$, the number of degrees of freedom. We set this as $n-1$.\n",
    "\n",
    "For high values of $\\nu$, the t-distribution converges to the normal distribution.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "x <- seq(-4,4,0.01)\n",
    "plot(x,dnorm(x,mean=0,sd=1), xlab=\"x\", type=\"l\", col='grey')\n",
    "lines(x, dt(x, df=1), col=\"red\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example: counting beetles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You have been asked to measure the density of a particular species of beetle in a field.\n",
    "\n",
    "The following are the observed numbers in 10 samples using a 1 m<sup>2</sup> [quadrat](https://en.wikipedia.org/wiki/Quadrat)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data <- c(5, 1, 6, 3, 2, 4, 2, 7, 1, 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculate a 95% confidence interval for the beetle density."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# first, calculate the sample mean and s.d.\n",
    "xbar <- mean(data)\n",
    "xbar\n",
    "\n",
    "s <- sd(data)\n",
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# now find the critical value\n",
    "t95 <- qt(1-0.05/2,df=9)  # df = n-1\n",
    "t95"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# finally calculate the 95% confidence interval for mu\n",
    "ci <- c(xbar - t95 * s / sqrt(10) , xbar + t95 * s / sqrt(10))\n",
    "ci"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
